41 research outputs found

    Evaluation and optimization of frequent association rule based classification

    Get PDF
    Deriving useful and interesting rules from a data mining system is an essential and important task. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. In this paper, a systematic way to evaluate the association rules discovered from frequent itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriated sequence of usage is presented. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided. Empirical results show that with a proper combination of data mining and statistical analysis, the framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy and coverage rules when used in the classification problem. Moreover, the results reveal the important characteristics of mining frequent itemsets, and the impact of confidence measure for the classification task

    Quality and interestingness of association rules derived from data mining of relational and semi-structured data

    Get PDF
    Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task

    Irrelevant feature and rule removal for structural associative classification

    Get PDF
    In the classification task, the presence of irrelevant features can significantly degrade the performance of classification algorithms,in terms of additional processing time, more complex models and the likelihood that the models have poor generalization power due to the over fitting problem.Practical applications of association rule mining often suffer from overwhelming number of rules that are generated, many of which are not interesting or not useful for the application in question.Removing rules comprised of irrelevant features can significantly improve the overall performance.In this paper, we explore and compare the use of a feature selection measure to filter out unnecessary and irrelevant features/attributes prior to association rules generation.The experiments are performed using a number of real-world datasets that represent diverse characteristics of data items.Empirical results confirm that by utilizing feature subset selection prior to association rule generation, a large number of rules with irrelevant features can be eliminated.More importantly, the results reveal that removing rules that hold irrelevant features improve the accuracy rate and capability to retain the rule coverage rate of structural associative association

    A System Dynamic Simulation Model for Managing the Human Error in Power tools Industries.

    Get PDF
    In the era of modern and competitive life of today, every organization will face the situations in which the work does not proceed as planned when there is problems occur in which it had to be delay. However, human error is often cited as the culprit. The error that made by the employees would cause them have to spend additional time to identify and check for the error which in turn could affect the normal operations of the company as well as the companys reputation. Employee is a key element of the organization in running all of the activities of organization. Hence, work performance of the employees is a crucial factor in organizational success. The purpose of this study is to identify the factors that cause the increasing errors make by employees in the organization by using system dynamics approach. The broadly defined targets in this study are employees in the Regional Material Field team from purchasing department in power tools industries. Questionnaires were distributed to the respondents to obtain their perceptions on the root cause of errors make by employees in the company. The system dynamics model was developed to simulate the factor of the increasing errors make by employees and its impact. The findings of this study showed that the increasing of error make by employees was generally caused by the factors of workload, work capacity, job stress, motivation and performance of employees. However, this problem could be solve by increased the number of employees in the organization

    Assessing stakeholder’s credit risk using data mining in construction project

    Get PDF
    Nowadays, the rapid growth of national and global economic demands an efficient,innovative and cost effective for building and infrastructure project. Partnering in construction projects are complex in nature due to human and non-human factors variable.For instance, credit capacity is a common attribute from client’s perspectives when selecting partners in their construction project. However, the assessment of the credit risk capacity of partners (such as project manager, quantity surveyor, consultant, and contractor) is neglected particularly involving design build projects in Malaysia.Due to unforeseen risk associated to credit capacity, project delay and cost overrun occur frequently in Malaysian construction industry.Thus, this research aims to develop a framework for accessing credit risk using data mining for design build project. This study will employ case study approach in order to gather information, develop data mining model and validation with real case projects involving public clients.The framework will enable public client to select appropriate partners for their construction project with minimal risk. It is anticipated that this study will yield an efficient artifact to improve the existing government procurement system such as ePerolehan and e-Perunding

    A conceptual framework for predicting the effects of encroachment on magnitude of flood in Foma-river area, Kwara State, Nigeria using data mining

    Get PDF
    Flooding occurs but there is no flood hazard. It is only after human encroachment into the floodplain that turn into hazard. The practice of continuous increase of properties development along the floodplain and indiscriminate refuse disposal into water channels have been a major constituting factors to intensive flooding along the floodplain. This is as a result of decline in the capacity of floodplain to absorbs excess flooding, thus resulting to exposing more urban areas to be vulnerable to flood. Foma-river is located in Ilorin Kwara, Nigeria on latitude N08,49574 and longitude E004,5107. Climate of Ilorin comprises of the dry and wet seasons with the wet season starting around March and lasting for about four to five months. This study intends to propose a conceptual framework to support the prediction of effects of season on magnitude of flood in Foma-river area using data mining approach based on 7 years sampled data from Nigeria Meteorological Agency (NIMET), and questionnaire responses from residents along Foma-river floodplai

    Evaluation of machine learning classifiers in faulty die prediction to maximize cost scrapping avoidance and assembly test capacity savings in semiconductor integrated circuit (IC) manufacturing

    Get PDF
    Semiconductor manufacturing is a complex and expensive process. The semiconductor packaging trending towards for more complex package with higher performance and lower power consumption. The silicon die is manufactured using smaller fab process technology node and packaging technology is using more complex and expensive packaging. The semiconductor packaging trend has evolved from single die packaging to multi die packaging. The multi die packaging requires more processing steps and tools in assembly process as well. All these factors cause cost per unit to increase. With this multi die packaging, it results higher loss in production yield compared to single die packaging because overall yield now is a function of multiplication of yield for each individual die. If any die from the final package tested at Class and found to be faulty not meeting the product specification, even the rest of die still passing the tests, the whole package will still be scrapped. This resulting in wasted good raw material (good die and good substrate) and manufacturing capacity used to assemble and test affected bad package. In this research work, a new framework is proposed for model training and evaluation for the machine learning application in semiconductor test with objective to screen bad die using machine learning before die attachment to package. The model training flow will have 2 classifier groupings which are control group and auto machine learning (ML) where feature selection with redundancy elimination method to be applied on input data to reduce the number of variables to minimum prior modeling flow. The control group will serve as reference. The other group, will use auto machine learning (ML) to run multiple classifiers automatically and only top 3 to be selected for next step. The performance metric used is recall rate at specified precision from ROI breakeven point. The threshold probability that correspond to fixed precision will be set as the classifier threshold during model evaluation on unseen datasets. The model evaluation flow will use 3 different non-overlapped datasets and comparison of classifiers will be based on recall rate and precision rate. This new framework will be able to provide range of possible recall rate from minimum to maximum, to identify which classifier algorithm performs the best for given dataset. The selected model can be implemented into actual manufacturing flow to screen predicted bad die for maximum cost scrapping avoidance and capacity savings

    An innovative data mining and dashboard system for monitoring of Malaysian dengue trends

    Get PDF
    Monitoring dengue fever become an important task in reducing dengue outbreaks crisis. These monitoring tasks offered the stakeholder such as the Ministry of Health Malaysia (MOH) well informed status of the dengue fever. There are abundant dengue cases reported in Malaysia including mortality recorded over the past year. Data from Malaysian Open Data portal reveals, 21,900 cases of dengue fever were reported in 2012 with 35 deaths.However, this information are dispersed and circulated among several ministry and stakeholder.As such, information regarding the dengue outbreak belongs to MOH, while the information of population and density belong to another stakeholder.Putting this information into one monitoring system required an innovative system that capable to extract many data and information from several databases and capable to summarize these data into meaningful information. Knowing the dangerous effect of dengue fever, thus one of the solutions is to implement an innovative forecasting and dashboard system of dengue spread in Malaysia, with emphasize on an early prediction of dengue outbreak.Importantly, this research will deliver the message to health policy makers such as The Ministry of Health Malaysia (MOH), practitioners, and researchers of the importance to integrate their collaboration in exploring the potential strategies in order to reduce the future burden of the increase in dengue transmission cases in Malaysia

    A statistical interestingness measures for XML based association rules

    Get PDF
    Recently mining frequent substructures from XML data has gained a considerable amount of interest. Different methods have been proposed and examined for mining frequent patterns from XML documents efficiently and effectively. While many frequent XML patterns generated are useful and interesting, it is common that a large portion of them is not considered as interesting or significant for the application at hand. In this paper, we present a systematic approach to ascertain whether the discovered XML patterns are significant and not just coincidental associations, and provide a precise statistical approach to support this framework. The proposed strategy combines data mining and statistical measurement techniques to discard the non significant patterns. In this paper we considered the "Prions" database that describes the protein instances stored for Human Prions Protein. The proposed unified framework is applied on this dataset to demonstrate its effectiveness in assessing interestingness of discovered XML patterns by statistical means.When the dataset is used for classification/prediction purposes, the proposed approach will discard non significant XML patterns, without the cost of a reduction in the accuracy of the pattern set as a whole

    Profiling Oman education data using data visualization technique

    Get PDF
    This research works presents an innovative data visualization technique to understand and visualize the information of Oman’s education data generated from the Ministry of Education Oman “Educational Portal”.The Ministry of Education in Sultanate of Oman have huge databases contains massive information.The volume of data in the database increase yearly as many students, teachers and employees enter into the database. The task for discovering and analyzing these vast volumes of data becomes increasingly difficult. Information visualization and data mining offer a better ways in dealing with large volume of information.In this paper, an innovative information visualization technique is developed to visualize the complex multidimensional educational data. Microsoft Excel Dashboard, Visual Basic Application (VBA) and Pivot Table are utilized to visualize the data.Findings from the summarization of the data are presented, and it is argued that information visualization can help related stakeholders to become aware of hidden and interesting information from large amount of data drowning in their educational portal
    corecore